21th, May 2022
Author: Sebastian Heß

Data Science: Airbnb Munich - Machine Learning Analysis.

This project was part of my Udacity Data Scientist Nanodegree and helped me to improve my skills, knowledge and experience about data science, machine learning and writing a Data Science blog post (My Github Repository).

p_1%20Kopie.jpg


Table of Contents

CRISP-DM (Cross Industry Process for Data Mining)

  1. Business Understanding
    1. Introduction
  2. Data Understanding
    1. Gathering
    2. Assessing
  3. Preparing Data
    1. Cleaning
  4. Model Data
    1. Answering the Questions
  5. Evaluate the Results
    1. Conclusions
  6. Deploy

CRISP-DM (Cross Industry Process for Data Mining)

1. Business Understanding

This means understanding the problem and questions you are interested in tackling in the context of whatever domain you're working in. Examples include:

How do we acquire new customers? Does a new treatment perform better than an existing treatment? How can we improve communication? How can we improve travel? How can we better retain information?

2. Data Understanding

At this step, you need to move the questions from Business Understanding to data. You might already have data that could be used to answer the questions, or you might have to collect data to get at your questions of interest.

Here we used the StackOverflow data to attempt to answer our questions of interest. We did 1. and 2. in tandem in this case, using the data to help us arrive at our questions of interest. This is one of two methods that is common in practice. The second method that is common is to have certain questions you are interested in answering, and then having to collect data related to those questions.

3. Prepare Data

Gathering Data

Luckily Stack Overflow has already collected the data for us. However, we still need to wrangle the data in a way for us to answer our questions. The wrangling and cleaning process is said to take 80% of the time of the data analysis process. You will see that will hold true through this lesson, as a majority of the remaining parts of this lesson will be around basic data wrangling strategies.

This is commonly denoted as 80% of the process. You saw this especially when attempting to build a model to predict salary, and there was still much more you could have done. From working with missing data to finding a way to work with categorical variables, and we didn't even look for outliers or attempt to find points we were especially poor at predicting. There was ton more we could have done to wrangle the data, but you have to start somewhere, and then you can always iterate.

4. Model Data

We were finally able to model the data, but we had some back and forth with step 3. before we were able to build a model that had okay performance. There still may be changes that could be done to improve the model we have in place. From additional feature engineering to choosing a more advanced modeling technique, we did little to test that other approaches were better within this lesson.

5. Evaluate the Results

Results are the findings from our wrangling and modeling. They are the answers you found to each of the questions.

6. Deploy

Deploying can occur by moving your approach into production or by using your findings to persuade others within a company to act on the results. Communication is a very important part of the role of a data scientist.


1. Business Understanding

Table of Content

1.A. Introduction

Table of Content

I live in Munich and when I travel to other places and countries, I mostly use the Airbnb platform. I always ask myself the same questions as probably everyone in my case does. Therefore, I would like to answer these questions for my hometown Munich based on a data set provided by "Insideairbnb.com".

  1. Will I quickly find a suitable apartment in Munich?
  2. Which district is the most chosen one based on the best ratings?
  3. Which apartments are not attractive at all in Munich?
  4. What features are contributing to estimate the price of an apartment?

Data Source:

1.A.1. Imported Libraries and Modules

Table of Content
  • OS a python module which priveds functions for interacting with the operating system OS.
  • Numpy a Python library for working with arrays.
  • Pandas a Python library for data manipulating and analysis.
  • pd.pandas.set_option( ) a Pandas option to visualise all of the columns in a data frame.
  • folium a Pandas library to visualise interactive maps.
  • Matplotlib a Python library for data visualizations.
  • %matplotlib inline a magic function for inline plotting of graphs.
  • sklearn.linear_model a class of the sklearn module containing different functions for performing machine learning with linear models.
  • sklearn.model_selection a class of the sklearn module splitting the data in train and test data.
  • sklearn.metrics a class of the sklearn metrics to use with any continous response variable.
  • Seaborn a Python library for data visuallization based on matplotlib.
  • sns.set_style to define the seaborn figure aesthetic as "darkgrid".
  • dataframe_image to save pictures of data frames.

1.A.2. Helper Functions

Table of Content

Defining helper functions for the Analysis.


2. Data Understanding

Table of Content

2.A. Gathering

Table of Content

Reading the data set listings.csv into the Pandas data frame df.

2.B. Assessing

Table of Content

Assessing the data set to get a first overview of the dimensions and a better understanding of the data and its statistics.

There are a lot columns which are not necessary for the analysis and can be dropped.

The data includes Nan values.

There are NO duplicates within the data.

The Check for NaN values is posititve. The data contents NaN values.

Checking for NaN values more detailed to understand the data.

Extracting the columns which consits only of NaN values.

Function which helps to detect NaN values of a specific column.

The amount of the categorial columns within the data is 33.

There is a lot of Airbnb data spread over Munich and which can be used to visualize it within a geographical map.

Assessing Summary

  1. The data frame consits of 74 columns and 4995 rows.
  2. The data frame contains NaN values.
  3. The data frame contains 4 columns which values are all null.
  4. The data frame contains missing values.
  5. The data frame does NOT contain duplicate values.
  6. The data frame contains data types which need to be changed.
  7. The data frame contains columns/variables which need to be dropped.
  8. Some values of the data frame have to be imputed.
  9. Imputing the values of column name does NOT makes sense since the values are strings.
  10. Imputing the values of column host_since does NOT makes sense since the values are firm dates.
  11. Imputing the values of column host_total_listings_count does NOT makes sense since the values are precise quantitative information.
  12. Imputing the values of column bathrooms_text does NOT makes sense since the values are precise information regarding furnishing.
  13. The values of column price have to be changed from USD to EUR.

3. Preparing Data

Table of Content

3.A. Cleaning

Table of Content

Cleaning the data set to identify and fix any issues like incorrect, inaccurate, incomplete, incorrectly formatted, duplicated, or irrelevant data of the data set.


3.A.1. Dropping not needed columns

Dropping of irrelevant columns.

Code

Test

All of the irrelevant columns are dropped.


3.A.2. Selecting necessary columns and creating final data set

Creating an new data frame of all relevant columns.

Code

Test

The new data frame with all relevant columns columns is created.


3.A.3. Changing the data types

Changing tha data types of some variables.

Code

Test

The new date type is existing.


3.A.4. Reworking string values

Reworking the string values of some variables.

Code

Test

The string value has been adjusted.


3.A.5. Changing the acual currency USD into EUR

Replacing parts of the string value, data type and changing the currency.

Code

Test

The string values, data types and the currency values are changed.


3.A.6. Imputing numerical an categorial values

Imputing all missing/NaN numerical and categorial values of the data frame.

Code

Test

All missing/NaN numeric and categorical values of the data frame were imputed. There are no more NaN values.

Defining new Data Frame

Exporting new Data Frame

  • Finally the cleaned data frame df_new can be used for model the data.

4. Model Data

Table of Content

4.A. Answering the Questions.

Table of Content

Analyse the data to answer the questions.


Question: 1. Will I quickly find a suitable apartment in Munich?


Defining a new data frame of the data to group it and count the quantity.


Defining the relative values [%] of the data.


Plotting a pie chart of the booking possibility and saving it as a picture.

Q1 - Answer: The answer ist No. It is not possible to quickly find an apartment in Munich. Only 28 % of the offered apartments are available immediatly.



Question: 2. Which district is the most chosen one based on the best ratings?


Defining a new data frame of the data.


Defining a new data frame of the data.

Filtering and creating a new data frame of the data.


Grouping the data frame and count the amount per neighborhood.


Filtering the data frame to get the first three values for the neighbourhood.


Defining the colors per neighborhood.


Creating an interactive map to visualize each apartment of the filtered values.


Saving the interactive map as a html file.



Question: 3. Which apartments are not attractive at all in Munich?


Defining a new data frame of the data.


Visualizing the distribution of the Airbnb user review scores.


Counting the old data frame with filtering everything but the worst review scores.


Counting the amount of apartments while filtering the worst review scores.


Calculating the ratio compared to the whole amount of apramtens.


Function to help to set the right colors for the markers within the map.


Creating an interactive map to visualize each apartment of the filtered values.


Saving the interactive map as a html file.



Question: 4. What features are contributing to estimate the price of an apartment?


Defining a new data frame of the data.


Checking and plotting the histograms of every numerical variable of the data.


Checking and plotting the price distribution of the apartments.


Checking and plotting the neighbourhood district of the apartments correlated to the price.


Checking and plotting the room types of the apartments correlated to the price.


Checking and plotting the price distribution of the apartments correlated to the rating score.


Defining a new data frame of the data.


Function to help to create the data frame including dummies as a preparation for the machine learning model.


Creating a new data frame including the dummies for all categorial variables.


Function to create the machine learning model that can be used to predict the influences of the house price.


Checking the Rsquared of the machine learning model train data and test data.

The big delta between of the test data and train data shows, that the machine learning model is OVERFITTING.
Conclusion: It is necessary to optimize the machine learning model.


Function to OPTIMIZE the Rsquared of the machine learning model train data and test data considerung cutoffs.


Calculating the number of columns and the best rsquared of the test data and the training data.


Function to calculate the coefficiants of the X-features to see which ones matter the most in the machine learning model It is necessary, because a rich regression model is used.


Calculating the coefficiants of the X-features and printing a list for further investigations.


5. Evaluate the Results

Table of Content

5.A. Conclusions

Table of Content


Question: 1. Will I quickly find a suitable apartment in Munich?

Sometimes I make a last-minute decision to travel somewhere. And if it is a bigger city I always ask myself, if it is possibly to find an apartment within 2 to 3 days before departing. Of course in the most interesting district of the town and for an appropiate price and good quality.

In big cities I expect first of all a huge variaty of corresponding offers and second a big market which leads to an appropiate price.

But how is it actually in the city where I live? In Munich.

plot_a1.png

Well. There are a lot of Airnbnb apartments in Munich available. Nothing new, I guess. But at least there where 4,995 apartments registered, based on the dataset (Date Compiled: 24 December, 2021) I chose for my Analyze.

plot_q1.png

Based on the "instant bookable" feature, the result of the analyse is clear.

It is not possibly to find an apartment within a few days.

But there is always luck.
Anyway, I suggest to you to plan your trip to Munich a few days earlier.



Question: 2. Which district is the most chosen one based on the best ratings?

Usually I like to travel without a car. I love car driving, but cities I like to discover with my feet. About Munich I woul like to know if I would have to travel a lot to arrive my final destination from the Munich central station. So are there neighbouring district to the central station which have a goo user score rating? Let´s have a look at the result in the following interactive map.

  • Ludwigsvorstadt-Isarvorstadt
    144 registered apartments


  • Au-Haidhausen
    106 registered apartments


  • Schwabing-West
    101 registered apartments



And in fact there are some apartments next to the central station with very good user score rating (equal to or greater than 5.00 scores).

  • But the majority is distributed in the closest district called "Ludwigsvorstadt-Isarvorstadt".
  • Followed by "Au-Haidhausen" on the other side of the river Isar.
  • And a little bit north of the central station there ist "Schwabing-West". Not far away from the park called "Englischer Garten".

plot_q2.png



Question: 3. Which apartments are not attractive at all in Munich?

Now we come to the black sheep. Who does not know them. They are everywhere. Yes, I definitely would not want to rent a less aesthetic or even dirty apartment. And I definetely want interact with a reliable and communicative host.

Let us have a first look at the distribution of Airbnb user review scores in Munich.

Lukily most of the review scores are between the review scores values 4.00 and 5.00.
(Please note that the review score of 5 is the highest rating.)

plot_q3_1.png

And there they are, the outliers. Naturally existing in every data set, group, flok or swarm.
Considering that user review scores around a value of 3.00 are still a satisfactory evaluation, I decided to filter only the data with user review scores equal to or less than 2.00 scores.

As a result I found 68 apartments with a very bad rating.
Compared to the overall qantity of apartments in Munich it is an negligible value of 1.4 %.

Anyway, if you do not like incredible adventures, you should choose your Airbnb apartment wiesely.

plot_q3_2.png



Question: 4. What features are contributing to estimate the price of an apartment?

Originally, I wanted to analyze the price across the seasons by using machine learning. But unfortunately the data did not include a price that changed over time. So I decided to look at the features that help estimate the price of an apartment. It is getting a little bit more serious now as you probably noticed.

By counting the number of apartments correlated to the price it becomes clear that there are some apartments with really cheeky high prices. Here they are again, the runaways.

However, the result shows that most of the registered apartments are in the acceptable price range.

plot_q4_2-4.png

More interesting is the price distribution of the Airbnb apartments by districts.
This results in the following short list:

Most expensive districts:

  1. Altstadt-Lehel
  2. Schwanthalerhöhe
  3. Ludwigvorstadt-Isarvorstadt

Least expensive districts:

  1. Aubing-Lochhausen-Langwied
  2. Milbertshofen-Am Hart

plot_q4_3-3.png

While having a look at the price distribution of the Airbnb apartments by room type it is obviously that the following categories are the one with Higher prices** per average.

Higher prices:

  • Entire home / Entire apartment
  • Hotel room

Whereas the lowest average of prices for renting an Airbnb apartment in Munich inclunde the following categories:

Lower prices:

  • Private room
  • Share room

Interestingly there is one conspicuity when it comes to the the price diversification. The following categories have the biggest price range up to 500 € per rent:

Biggest price range:

  • Entire home / Entire apartment
  • Private room

plot_q4_4-2.png

Before I present the results of my machine learning model, I thought about looking again at the distribution of user score ratings correlated with price by using a scatterplot.
The result is similar to the previous histogramm with one big difference.

Here it is definetly an evidence that it is possible to find a good rated Airbnb apartment for an apropriate price in Munich.

Why? Because the biggest amount of these tiny dots is concentrated between a user score rating from 4.00 to 5.00 scores and under a price of 100 € per apartment.

We will not talk about the outliers again. They are everywhere.

plot_q4_5.png

Based on the results of the linear regression machine learning modell I used for my analyze, the following features determ how high the price for an Airbnb apartment in Munich will be:

  • District
  • Type and amount of bathrooms

plot_q4_6.png

Conclusion

For visiting Munich and renting an apartment at Airbnb you should:

1. Book your apartment with adequate time upfront.

2. Consider the follwing districts first to find an good quality apartment:

  • Ludwigsvorstadt-Isarvorstadt
  • Au-Haidhausen
  • Schwabing-West

3. Avoid booking an apartment with a user rating score less than 2.00 scores.

4. Select your personal preferation to choose an apartment you feel comfortable with.

Most expensive districts:

  • Altstadt-Lehel
  • Schwanthalerhöhe
  • Ludwigvorstadt-Isarvorstadt

Least expensive districts:

  • Aubing-Lochhausen-Langwied
  • Milbertshofen-Am Hart

Higher priced room type:

  • Entire home / Entire apartment
  • Hotel room

Lower priced room type:

  • Private room
  • Share room

Biggest price ranged room type:

  • Entire home / Entire apartment
  • Private room

6. Deploy

Table of Content

My Github Repository

MEDIUM LINK !!!

neuschwanstein-g3f199ebd2_1280.jpg